Deep Audio-visual Learning: A Survey
نویسندگان
چکیده
Abstract Audio-visual learning, aimed at exploiting the relationship between audio and visual modalities, has drawn considerable attention since deep learning started to be used successfully. Researchers tend leverage these two modalities improve performance of previously considered single-modality tasks or address new challenging problems. In this paper, we provide a comprehensive survey recent audio-visual development. We divide current into four different subfields: separation localization, correspondence generation, representation learning. State-of-the-art methods, as well remaining challenges each subfield, are further discussed. Finally, summarize commonly datasets challenges.
منابع مشابه
Open-Domain Audio-Visual Speech Recognition: A Deep Learning Approach
Automatic speech recognition (ASR) on video data naturally has access to two modalities: audio and video. In previous work, audio-visual ASR, which leverages visual features to help ASR, has been explored on restricted domains of videos. This paper aims to extend this idea to open-domain videos, for example videos uploaded to YouTube. We achieve this by adopting a unified deep learning approach...
متن کاملMultimodal Transfer Deep Learning with Applications in Audio-Visual Recognition
We propose a transfer deep learning (TDL) framework that can transfer the knowledge obtained from a single-modal neural network to a network with a different modality. Specifically, we show that we can leverage speech data to fine-tune the network trained for video recognition, given an initial set of audio-video parallel dataset within the same semantics. Our approach first learns the analogyp...
متن کاملComparing the Impact of Audio-Visual Input Enhancement on Collocation Learning in Traditional and Mobile Learning Contexts
: This study investigated the impact of audio-visual input enhancement teaching techniques on improving English as Foreign Language (EFL) learnersˈ collocation learning as well as their accuracy concerning collocation use in narrative writing. In addition, it compared the impact and efficiency of audio-visual input enhancement in two learning contexts, namely traditional and mo...
متن کاملDeep Visual Domain Adaptation: A Survey
Deep domain adaption has emerged as a new learning technique to address the lack of massive amounts of labeled data. Compared to conventional methods, which learn shared feature subspaces or reuse important source instances with shallow representations, deep domain adaption methods leverage deep networks to learn more transferable representations by embedding domain adaptation in the pipeline o...
متن کاملPerceptual audio loss function for deep learning
PESQ, Perceptual Evaluation of Speech Quality [5], and POLQA, Perceptual Objective Listening Quality Assessment [1] , are standards comprising a test methodology for automated assessment of voice quality of speech as experienced by human beings. The predictions of those objective measures should come as close as possible to subjective quality scores as obtained in subjective listening tests, us...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: International Journal of Automation and Computing
سال: 2021
ISSN: ['1751-8520', '1476-8186']
DOI: https://doi.org/10.1007/s11633-021-1293-0